-
Notifications
You must be signed in to change notification settings - Fork 53
(5/N) Read database access records on boot #8925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
fd96652
to
d458639
Compare
3bbe988
to
d7b63e3
Compare
d458639
to
6e20a24
Compare
d7b63e3
to
1f66310
Compare
e301683
to
b3e696e
Compare
1f66310
to
60d389f
Compare
60d389f
to
c66f6ce
Compare
c3be777
to
c289efd
Compare
c66f6ce
to
f1235b5
Compare
7d2b6e1
to
6afc8af
Compare
f1235b5
to
68a155c
Compare
6afc8af
to
8d9782a
Compare
68a155c
to
0866165
Compare
8d9782a
to
939eaf9
Compare
0866165
to
9db5b65
Compare
939eaf9
to
ed8c3b7
Compare
51febae
to
f5aa67c
Compare
a122af5
to
653a08f
Compare
f5aa67c
to
5ba6210
Compare
653a08f
to
c329ca8
Compare
5ba6210
to
79a12dd
Compare
c329ca8
to
ebaa6a2
Compare
79a12dd
to
66e720e
Compare
ebaa6a2
to
38a265b
Compare
66e720e
to
3e61a97
Compare
DatastoreSetupAction::NeedsHandoff { nexus_id } => { | ||
info!(log, "Datastore is awaiting handoff"); | ||
|
||
datastore.attempt_handoff(*nexus_id).await.map_err( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is implemented in #8932
"Could not handoff to new nexus"; | ||
err | ||
); | ||
BackoffError::transient( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are all the error cases here transient? Looking at 8932, maybe NexusInWrongState
is a permanent error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think even in this case, I do want us to retry.
Right now, it's actually not possible for us to return from the NexusInWrongState
case:
- It's returned from a transaction that checks "all nexus state records are either
quiesced
ornot_yet
" - ... and it's returned when our record specifically
!= not_yet
, therefore it must bequiesced
. - ... but if our record is
quiesced
, the previous call tocheck_schema_and_access
would have returnedRefuse
, rather thanNeedsHandoff
- ... further, the transition from
active
toquiesced
shouldn't be racy (without operator intervention) because each Nexus should be responsible for performing this transition for itself.
So, TL;DR:
- This check is currently defensive, and not possible without someone poking db records manually
- When we retry, we retry calling
check_schema_and_access
again, before re-tryingattempt_handoff
- If we do that, while also being
quiesced
, we'll converge to "locked out of the db" - ... BUT ALSO in the future, or in the face of weird concurrent events, IMO the decision to "restart all checks" doesn't seem like a bad choice.
I think that seeing a weird state in one of these branches, and choosing to "re-evaluate the world again from scratch" seems like a reasonable choice. check_schema_and_access
should be able to determine a reasonable next step, and it only throws errors if we cannot access the database (which I consider to be transient).
warn!( | ||
log, | ||
"Cannot check schema version / Nexus access"; | ||
"error" => InlineErrorChain::new(err.as_ref()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tiny nit - InlineErrorChain
implements slog::KV
and provides the "error"
key, so this can be shortened to
"error" => InlineErrorChain::new(err.as_ref()), | |
InlineErrorChain::new(err.as_ref()), |
(Same nit below, but also I don't feel strongly about this so if you prefer the explicit key, feel free to leave them.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Patched in bbb204c
@@ -108,11 +110,40 @@ async fn main_impl() -> anyhow::Result<()> { | |||
} | |||
Cmd::Upgrade { version } => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessarily on this PR - should we do more to discourage the use of this tool? If something goes wrong with handoffs we'll needed it, but in general we shouldn't run this anymore (once the whole stack of work lands), right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
totally agreed - I think the main usage of this is from https://github.com/oxidecomputer/meta/blob/master/engineering/dogfood/overview.adoc , which we can and should patch once we're ready to go to the full "online update" world for Nexus.
The changes here look good. I want to check my understanding of the risk and impact of this. During the next MUPdate from, say, R16, in chronological order:
If there's a subsequent blueprint execution, it won't change the records (assuming no zones have been added or expunged). So again, for all subsequent restarts of Nexus after that point, they'll go through that same flow. If we MUPdate again after that:
Subsequent MUPdates are like that one until we change the software further (e.g., to have Nexus do schema updates). Is that all correct? What, if any of that, do you think we want to test manually before landing this? |
Correct, a Nexus which upgrades and has an old schema will run Without #8912 , this will fail, and hit the retry loop.
Yup.
Yeah, populating these records should come from
Even if those zones are added or expunged, the records should be updated, as a part of #8924 But if the set of zones stay the same, these records should stay the same too.
Yup, this matches my understanding.
WDYT about a deployment of:
(I can do that manually, or we could merge this and do it as a part of the dogfood rollout - either way would be fine with me) |
By "the old path for schema update still works", you mean basically doing this?
That seems pretty good. If it's easy enough to try before landing this, that seems worthwhile. |
I'm using rkadm on berlin to test this right now, with: $ /opt/rackletteadm/bin/rkadm --toml /opt/rackletteadm/configs/berlin/berlin-rkadm.toml run \
--commit-a-install 032c5569fda856de03de0a5db1f4617351200c01 \
--commit-b-upgrade fed7aa07df18d4f7d42fcf27194f1db61b7edf01 \
--tuf-repo-output-dir-a-install /data/local/env/berlin/sean-a \
--tuf-repo-output-dir-b-upgrade /data/local/env/berlin/sean-b This upgrades from Will report back when that update is installed. Should be able to reboot a Nexus after that whole process too. EDIT: Picked the wrong target commit; merging and trying again... |
Okay, took me a couple times, because typing the right commit hashes is apparently a daunting task. Getting the wrong commit certainly showed me the schema update is running, and still doesn't want to downgrade! From
That test covers "main -> the addition of this PR". I'm also going to upgrade using:
To cover the case where "actually do a schema update", which will come from "a few commits ago -> the addition of this PR". Note that 3be1d57 was before the chain of |
Okay running:
Also succeeds.
|
Split off of #8845
Reads records on boot, validates access.
Depends on queries from #8932
Fixes #8501